proper name
On the Semantics of Large Language Models
Large Language Models (LLMs) such as ChatGPT demonstrated the potential to replicate human language abilities through technology, ranging from text generation to engaging in conversations. However, it remains controversial to what extent these systems truly understand language. We examine this issue by narrowing the question down to the semantics of LLMs at the word and sentence level. By examining the inner workings of LLMs and their generated representation of language and by drawing on classical semantic theories by Frege and Russell, we get a more nuanced picture of the potential semantic capabilities of LLMs.
- Europe > France (0.06)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > Indiana > Monroe County > Bloomington (0.04)
- Europe > Switzerland > Zürich > Zürich (0.04)
PolyIPA -- Multilingual Phoneme-to-Grapheme Conversion Model
This paper presents PolyIPA, a novel multilingual phoneme-to-grapheme conversion model designed for multilingual name transliteration, onomastic research, and information retrieval. The model leverages two helper models developed for data augmentation: IPA2vec for finding soundalikes across languages, and similarIPA for handling phonetic notation variations. Evaluated on a test set that spans multiple languages and writing systems, the model achieves a mean Character Error Rate of 0.055 and a character-level BLEU score of 0.914, with particularly strong performance on languages with shallow orthographies. The implementation of beam search further improves practical utility, with top-3 candidates reducing the effective error rate by 52.7\% (to CER: 0.026), demonstrating the model's effectiveness for cross-linguistic applications.
- North America > Canada > Quebec > Montreal (0.04)
- Europe > Croatia > Zagreb County > Zagreb (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
AyutthayaAlpha: A Thai-Latin Script Transliteration Transformer
Lauc, Davor, Rutherford, Attapol, Wongwarawipatr, Weerin
This study introduces AyutthayaAlpha, an advanced transformer-based machine learning model designed for the transliteration of Thai proper names into Latin script. Our system achieves state-of-the-art performance with 82.32% first-token accuracy and 95.24% first-three-token accuracy, while maintaining a low character error rate of 0.0047. The complexity of Thai phonology, including tonal features and vowel length distinctions, presents significant challenges for accurate transliteration, which we address through a novel two-model approach: AyutthayaAlpha-Small, based on the ByT5 architecture, and AyutthayaAlpha-VerySmall, a computationally efficient variant that unexpectedly outperforms its larger counterpart. Our research combines linguistic rules with deep learning, training on a carefully curated dataset of 1.2 million Thai-Latin name pairs, augmented through strategic upsampling to 2.7 million examples. Extensive evaluations against existing transliteration methods and human expert benchmarks demonstrate that AyutthayaAlpha not only achieves superior accuracy but also effectively captures personal and cultural preferences in name romanization. The system's practical applications extend to cross-lingual information retrieval, international data standardization, and identity verification systems, with particular relevance for government databases, academic institutions, and global business operations. This work represents a significant advance in bridging linguistic gaps between Thai and Latin scripts, while respecting the cultural and personal dimensions of name transliteration.
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (0.67)
- Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.67)
Killing Two Flies with One Stone: An Attempt to Break LLMs Using English->Icelandic Idioms and Proper Names
Ármannsson, Bjarki, Hafsteinsson, Hinrik, Jasonarson, Atli, Steingrímsson, Steinþór
This paper presents the submission of the \'Arni Magn\'usson Institute's team to the WMT24 test suite subtask, focusing on idiomatic expressions and proper names for the English->Icelandic translation direction. Intuitively and empirically, idioms and proper names are known to be a significant challenge for modern translation models. We create two different test suites. The first evaluates the competency of MT systems in translating common English idiomatic expressions, as well as testing whether systems can distinguish between those expressions and the same phrases when used in a literal context. The second test suite consists of place names that should be translated into their Icelandic exonyms (and correctly inflected) and pairs of Icelandic names that share a surface form between the male and female variants, so that incorrect translations impact meaning as well as readability. The scores reported are relatively low, especially for idiomatic expressions and place names, and indicate considerable room for improvement.
- Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.05)
- Europe > Finland > Southwest Finland > Turku (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- (11 more...)
Resolving Regular Polysemy in Named Entities
Hsieh, Shu-Kai, Tseng, Yu-Hsiang, Chou, Hsin-Yu, Yang, Ching-Wen, Chang, Yu-Yun
Word sense disambiguation primarily addresses the lexical ambiguity of common words based on a predefined sense inventory. Conversely, proper names are usually considered to denote an ad-hoc real-world referent. Once the reference is decided, the ambiguity is purportedly resolved. However, proper names also exhibit ambiguities through appellativization, i.e., they act like common words and may denote different aspects of their referents. We proposed to address the ambiguities of proper names through the light of regular polysemy, which we formalized as dot objects. This paper introduces a combined word sense disambiguation (WSD) model for disambiguating common words against Chinese Wordnet (CWN) and proper names as dot objects. The model leverages the flexibility of a gloss-based model architecture, which takes advantage of the glosses and example sentences of CWN. We show that the model achieves competitive results on both common and proper nouns, even on a relatively sparse sense dataset. Aside from being a performant WSD tool, the model further facilitates the future development of the lexical resource.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Taiwan > Taiwan Province > Taipei (0.04)
- Asia > Taiwan > Taiwan Province > Keelung (0.04)
- (18 more...)
Dual-Attention Neural Transducers for Efficient Wake Word Spotting in Speech Recognition
Sahai, Saumya Y., Liu, Jing, Muniyappa, Thejaswi, Sathyendra, Kanthashree M., Alexandridis, Anastasios, Strimel, Grant P., McGowan, Ross, Rastrow, Ariya, Chang, Feng-Ju, Mouchtaris, Athanasios, Kunzmann, Siegfried
We present dual-attention neural biasing, an architecture designed to boost Wake Words (WW) recognition and improve inference time latency on speech recognition tasks. This architecture enables a dynamic switch for its runtime compute paths by exploiting WW spotting to select which branch of its attention networks to execute for an input audio frame. With this approach, we effectively improve WW spotting accuracy while saving runtime compute cost as defined by floating point operations (FLOPs). Using an in-house de-identified dataset, we demonstrate that the proposed dual-attention network can reduce the compute cost by $90\%$ for WW audio frames, with only $1\%$ increase in the number of parameters. This architecture improves WW F1 score by $16\%$ relative and improves generic rare word error rate by $3\%$ relative compared to the baselines.
Novel Aficionados and Doppelg\"angers: a referential task for semantic representations of individual entities
Bruera, Andrea, Herbelot, Aurélie
In human semantic cognition, proper names (names which refer to individual entities) are harder to learn and retrieve than common nouns. This seems to be the case for machine learning algorithms too, but the linguistic and distributional reasons for this behaviour have not been investigated in depth so far. To tackle this issue, we show that the semantic distinction between proper names and common nouns is reflected in their linguistic distributions by employing an original task for distributional semantics, the Doppelg\"anger test, an extensive set of models, and a new dataset, the Novel Aficionados dataset. The results indicate that the distributional representations of different individual entities are less clearly distinguishable from each other than those of common nouns, an outcome which intriguingly mirrors human cognition.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > France (0.04)
- Asia > Middle East > Jordan (0.04)
- Research Report (0.64)
- Overview (0.47)
Testing the Quantitative Spacetime Hypothesis using Artificial Narrative Comprehension (II) : Establishing the Geometry of Invariant Concepts, Themes, and Namespaces
Given a pool of observations selected from a sensor stream, input data can be robustly represented, via a multiscale process, in terms of invariant concepts, and themes. Applying this to episodic natural language data, one may obtain a graph geometry associated with the decomposition, which is a direct encoding of spacetime relationships for the events. This study contributes to an ongoing application of the Semantic Spacetime Hypothesis, and demonstrates the unsupervised analysis of narrative texts using inexpensive computational methods without knowledge of linguistics. Data streams are parsed and fractionated into small constituents, by multiscale interferometry, in the manner of bioinformatic analysis. Fragments may then be recombined to construct original sensory episodes---or form new narratives by a chemistry of association and pattern reconstruction, based only on the four fundamental spacetime relationships. There is a straightforward correspondence between bioinformatic processes and this cognitive representation of natural language. Features identifiable as `concepts' and `narrative themes' span three main scales (micro, meso, and macro). Fragments of the input act as symbols in a hierarchy of alphabets that define new effective languages at each scale.
- North America > United States > New York (0.04)
- South America > Colombia (0.04)
- North America > United States > Illinois (0.04)
- (4 more...)
Multilingual person name recognition and transliteration
Pouliquen, Bruno, Steinberger, Ralf, Ignat, Camelia, Temnikova, Irina, Widiger, Anna, Zaghouani, Wajdi, Zizka, Jan
We present an exploratory tool that extracts person names from multilingual news collections, matches name variants referring to the same person, and infers relationships between people based on the co-occurrence of their names in related news. A novel feature is the matching of name variants across languages and writing systems, including names written with the Greek, Cyrillic and Arabic writing system. Due to our highly multilingual setting, we use an internal standard representation for name representation and matching, instead of adopting the traditional bilingual approach to transliteration. This work is part of the news analysis system NewsExplorer that clusters an average of 25,000 news articles per day to detect related news within the same and across different languages.
- Asia > Russia (0.28)
- Europe > United Kingdom (0.14)
- Europe > France (0.14)
- (14 more...)
- Leisure & Entertainment > Sports (1.00)
- Government > Regional Government > Europe Government (0.93)
- Government > Regional Government > North America Government > United States Government (0.46)
How Controlled English can Improve Semantic Wikis
The motivation of semantic wikis is to make acquisition, maintenance, and mining of formal knowledge simpler, faster, and more flexible. However, most existing semantic wikis have a very technical interface and are restricted to a relatively low level of expressivity. In this paper, we explain how AceWiki uses controlled English -- concretely Attempto Controlled English (ACE) -- to provide a natural and intuitive interface while supporting a high degree of expressivity. We introduce recent improvements of the AceWiki system and user studies that indicate that AceWiki is usable and useful.